Skip to content

feat(fine-tuning): storage-first delivery flow with TEE fallback#354

Merged
Ravenyjh merged 5 commits intomainfrom
feature/storage-first-flow-v2
Feb 12, 2026
Merged

feat(fine-tuning): storage-first delivery flow with TEE fallback#354
Ravenyjh merged 5 commits intomainfrom
feature/storage-first-flow-v2

Conversation

@zeyuyuyu
Copy link
Collaborator

Summary

  • Storage-first model delivery: After training, broker always encrypts LoRA locally (TEE backup), then attempts 0G Storage upload. Falls back to local encrypted copy if storage upload fails.
  • Handle raw JSONL from 0G Storage: When CLI uploads raw JSONL to 0G Storage, broker detects file type via ZIP magic bytes and converts raw JSONL to HuggingFace DatasetDict format automatically.
  • Retry safety: Cleans up existing target before rename on retry to avoid "file exists" errors.

Key Changes

Storage Client (storage/client.go)

  • DownloadFromStorage: detects ZIP vs raw file via magic bytes (PK\x03\x04)
  • Non-ZIP files are moved directly to target path instead of failing with "zip: not a valid zip file"
  • Existing target cleaned up before rename (retry safety)

Setup Service (services/setup.go)

  • After 0G Storage download, checks if result is a file (raw JSONL) vs directory (HF format)
  • Converts raw JSONL to HuggingFace DatasetDict format (supports instruction, chat, text formats)
  • Uses Python directly or Docker container fallback for conversion

Finalizer (services/finalizer.go)

  • Always encrypts LoRA locally first (TEE backup)
  • Attempts 0G Storage upload if skipStorageUpload is false; falls back to local hash on failure

Test plan

  • CLI --dataset-path uploads JSONL to 0G Storage
  • Broker downloads raw JSONL from 0G Storage without "not a valid zip" error
  • Raw JSONL auto-converted to HF format (verified via token counter)
  • Token counting succeeds on converted dataset
  • Full training cycle completes end-to-end
  • Model download with auto mode (0G Storage → TEE fallback)

Made with Cursor

Wuying Created Local Users and others added 3 commits February 11, 2026 18:20
- Finalizer now always encrypts locally first (TEE backup)
- Then attempts upload to 0G Storage (unless skipStorageUpload=true)
- If storage upload fails, falls back to local hash gracefully
- Encrypted file is always retained for TEE download fallback
- Removes old encryptAndUploadModel (split into encrypt + upload)

Co-authored-by: Cursor <cursoragent@cursor.com>
Previously DownloadFromStorage always assumed ZIP format, causing
"zip: not a valid zip file" when CLI uploads raw JSONL to 0G Storage.

Changes:
- Detect file type via magic bytes before attempting unzip
- For non-ZIP files (raw JSONL), move directly to target path
- Convert raw JSONL to HuggingFace DatasetDict format after download
  (required by token counter and training executor)

Co-authored-by: Cursor <cursoragent@cursor.com>
- DownloadFromStorage: detect file type via magic bytes, handle non-ZIP
- Clean up existing target before rename on retry
- Convert raw JSONL to HF DatasetDict format after download
- Auto-calculate fee when fee=0 (broker-calculated fee mode)
- Verify signature with original fee to avoid mismatch

Co-authored-by: Cursor <cursoragent@cursor.com>
@zeyuyuyu
Copy link
Collaborator Author

@claude

@claude
Copy link

claude bot commented Feb 11, 2026

Claude finished @zeyuyuyu's task in 1m 48s —— View job


Code Review

Todo List

  • Read changed files and understand implementation
  • Review against CLAUDE.md standards
  • Provide comprehensive feedback

Overview

This PR implements a storage-first delivery flow with TEE fallback for fine-tuning model delivery and adds support for raw JSONL dataset handling. The changes improve robustness by:

  1. Always creating a local encrypted backup (TEE endpoint)
  2. Attempting 0G Storage upload as primary delivery method
  3. Falling back to local hash if storage upload fails
  4. Auto-converting raw JSONL datasets to HuggingFace format

Critical Issues

[HIGH] storage/client.go:113 - Missing File Handle Cleanup

Issue: The file handle in isZipFile() is deferred to close, but if the caller doesn't properly handle errors, temporary files may accumulate.

Location: Lines 113-114

if err := os.Remove(tmpFile); err != nil && !errors.Is(err, os.ErrNotExist) {
    return "", err
}

Impact: If os.Remove(tmpFile) fails with a non-ErrNotExist error, the function returns without cleaning up the downloaded file. This could lead to disk space issues over time.

Fix: Ensure cleanup happens in a defer after successful download:

c.logger.Infof("Begin downloading %s, with root: %v", tmpFile, hash)
if err := indexerClient.Download(context.Background(), hash, tmpFile, true); err != nil {
    err = errors.Wrapf(err, "Error downloading data with root: %v", hash)
    c.logger.Errorf("%v", err)
    return "", err
}
defer func() {
    // Clean up temp file if we return early with error
    if err := os.Remove(tmpFile); err != nil && !os.IsNotExist(err) {
        c.logger.Warnf("Failed to remove temp file %s: %v", tmpFile, err)
    }
}()

[HIGH] storage/client.go:118 - Context Misuse

Issue: Using context.Background() instead of the passed ctx parameter.

Location: Line 118

if err := indexerClient.Download(context.Background(), hash, tmpFile, true); err != nil {

Impact: Ignores cancellation signals from parent context, preventing graceful shutdown and timeout handling. This violates Go best practices for context propagation.

Fix:

if err := indexerClient.Download(ctx, hash, tmpFile, true); err != nil {

[MEDIUM] storage/client.go:161-176 - Missing Error Check in Magic Bytes Detection

Issue: The isZipFile() function returns false if file read fails or has fewer than 4 bytes, but doesn't distinguish between errors and small files.

Location: Lines 169-171

n, err := f.Read(buf)
if err != nil || n < 4 {
    return false, nil  // Silently treats errors as "not a ZIP"
}

Impact: I/O errors are silently treated as "not a ZIP file", which could lead to incorrect handling. A corrupted download would be treated as a valid raw file.

Recommendation: Consider logging or returning the error:

n, err := f.Read(buf)
if err != nil {
    return false, fmt.Errorf("failed to read file magic bytes: %w", err)
}
if n < 4 {
    // Small files are definitely not ZIPs, but also not valid datasets
    return false, fmt.Errorf("file too small to be valid: %d bytes", n)
}

Security Issues

[nit] setup.go:428 - Command Injection Risk (Low)

Location: Line 428

cmd := exec.Command("python3", scriptPath, rawPath, datasetPath)

Status: ✅ Currently safe because all paths are controlled by the system, not user input.

Note: Good practice to use filepath.Clean() on paths before passing to exec, even though current usage is safe.

Error Handling & Logging

finalizer.go:86-92 - Excellent Error Handling

Good practice: The finalizer properly logs the local backup endpoint for user reference:

f.logger.Infof("Task %s encrypted locally. TEE backup available at /v1/user/%s/task/%s/lora", 
    task.ID, task.UserAddress, task.ID)

This provides clear fallback instructions if storage upload fails.

[MEDIUM] setup.go:435 - Inconsistent Error Handling

Issue: Direct Python execution logs a warning on failure but Docker execution returns an error. This creates inconsistent behavior.

Location: Lines 429-435

output, err := cmd.CombinedOutput()
if err == nil {
    s.logger.Infof("Converted raw dataset to HF format using Python: %s", string(output))
    os.Remove(rawPath)
    return nil
}
s.logger.Warnf("Direct Python conversion failed: %v (output: %s), trying Docker...", err, string(output))

Recommendation: Consider whether Python failure should be more severe (error level) since it indicates a missing dependency.

finalizer.go:100-108 - Excellent Fallback Strategy

Good practice: Clear logging distinguishes between storage upload failure and local fallback:

if uploadErr != nil {
    f.logger.Warnf("Failed to upload to 0G Storage for task %s: %v. Using local backup hash.", task.ID, uploadErr)
} else {
    f.logger.Infof("Successfully uploaded to 0G Storage for task %s, root hash: %s", task.ID, string(storageRootHash))
    settlementMetadata.ModelRootHash = storageRootHash
}

Resource Management

[HIGH] setup.go:422-425 - Missing Cleanup on Early Return

Issue: If the function returns early (lines 342, 347, etc.) before reaching line 432, the temporary script file is not cleaned up.

Location: Line 425

defer os.Remove(scriptPath)

Impact: The defer only applies after line 422, but the function can return earlier from os.WriteFile failure, leaving orphaned script files.

Fix: Move the defer immediately after successful write:

if err := os.WriteFile(scriptPath, []byte(pythonScript), 0644); err != nil {
    return errors.Wrap(err, "write conversion script")
}
defer os.Remove(scriptPath)

Wait, looking at the code again, the defer IS already placed correctly after the write. Let me reconsider... Actually, this is fine. The issue would only occur if the function exits before line 422, but looking at the code flow, all paths that call convertRawDatasetToHF properly reach this point. This is not an issue.

setup.go:432 - Proper Cleanup

Good practice: Raw JSONL file is cleaned up after successful conversion:

os.Remove(rawPath)

[nit] setup.go:449 - Inconsistent Cleanup Check

Issue: Line 432 doesn't check os.Remove error, but line 454 does.

Location: Lines 432 and 454

os.Remove(rawPath)  // Line 432: no error check
// vs
os.Remove(rawPath)  // Line 454: no error check either, actually

Actually, both are consistent - neither checks errors. For cleanup operations on temp files, ignoring errors is acceptable since the operation is best-effort.

Performance & Design

storage/client.go:124-129 - Smart File Type Detection

Good practice: Using magic bytes for ZIP detection is more reliable than file extensions:

isZip, err := isZipFile(tmpFile)
if err != nil {
    c.logger.Warnf("Could not detect file type for %s: %v, treating as raw file", tmpFile, err)
    isZip = false
}

This is the correct approach for handling mixed file types.

setup.go:363-418 - Inline Python Script

Good practice: Embedding the conversion script eliminates external dependencies and makes the code self-contained. The script properly handles three JSONL formats:

  • Instruction format (instruction, input, output)
  • Messages format (messages array)
  • Text format (text field)

[MEDIUM] finalizer.go:185 - Inefficient Hash Computation

Issue: Reading the entire encrypted file into memory to compute hash.

Location: Lines 181-185

encryptedData, err := os.ReadFile(encryptFile)
if err != nil {
    return nil, errors.Wrap(err, "failed to read encrypted file")
}
localRootHash := crypto.Keccak256(encryptedData)

Impact: For large models, this could consume significant memory. Fine-tuned LoRA models are typically small (few hundred MB), so this is acceptable, but for full models this would be problematic.

Recommendation: Consider streaming hash computation for large files:

f, err := os.Open(encryptFile)
if err != nil {
    return nil, errors.Wrap(err, "failed to open encrypted file")
}
defer f.Close()

hasher := sha3.NewLegacyKeccak256()
if _, err := io.Copy(hasher, f); err != nil {
    return nil, errors.Wrap(err, "failed to hash encrypted file")
}
localRootHash := hasher.Sum(nil)

Testing & Documentation

[nit] Missing Test Coverage

Observation: The PR description shows most test items are checked except:

  • Full training cycle completes end-to-end
  • Model download with auto mode (0G Storage → TEE fallback)

Recommendation: These integration tests are critical for verifying the storage-first flow works correctly. Consider adding them before merge, or document as follow-up work.

✅ Code Comments

Good practice: The PR adds helpful comments explaining the flow:

  • Line 111 in storage/client.go: Documents the temporary file approach
  • Line 86 in finalizer.go: Explains TEE backup creation
  • Line 338 in setup.go: Documents the format detection logic

Architectural Observations

✅ Storage-First Design

Excellent architectural decision: The new flow (local encrypt → storage upload → fallback) is more robust than the previous approach. Key benefits:

  1. Zero data loss: Model always available via TEE endpoint even if storage fails
  2. Better UX: Users can download immediately via TEE while storage upload happens
  3. Graceful degradation: System remains functional even with storage outages

✅ Format Flexibility

Good design: Auto-detecting and converting JSONL formats removes friction from the user experience. Supporting three JSONL formats (instruction, messages, text) covers common dataset patterns.

Summary

Critical Action Items

  1. [HIGH] Fix context usage in storage/client.go:118 - use ctx instead of context.Background()
  2. [HIGH] Add cleanup defer for temp download file in DownloadFromStorage()
  3. [MEDIUM] Improve error handling in isZipFile() to distinguish I/O errors from non-ZIP files

Recommendations

  1. Consider streaming hash computation for large files in finalizer.go:185
  2. Complete end-to-end testing before merge
  3. Document the new delivery flow in user-facing docs

Positive Highlights

  • Excellent fallback strategy with clear logging
  • Smart file type detection using magic bytes
  • Self-contained JSONL conversion with multiple format support
  • Proper resource cleanup in most paths
  • Well-commented code explaining design decisions

Overall, this is a solid PR that significantly improves robustness. The storage-first approach with TEE fallback is architecturally sound. Address the HIGH priority items before merge, and consider the medium priority recommendations for follow-up improvements.

--- |

- [HIGH] Use passed ctx instead of context.Background() in Download call
  for proper cancellation/timeout propagation
- [HIGH] Add defer cleanup for temp download file to prevent disk leaks
  on early error returns
- [MEDIUM] Improve isZipFile() to return errors on I/O failures instead
  of silently treating them as non-ZIP files

Co-authored-by: Cursor <cursoragent@cursor.com>
@zeyuyuyu
Copy link
Collaborator Author

@claude

@claude
Copy link

claude bot commented Feb 11, 2026

Claude finished @zeyuyuyu's task in 1m 48s —— View job


Code Review

Todo List

  • Read changed files and understand implementation
  • Review against CLAUDE.md standards
  • Provide comprehensive feedback

Overview

This PR implements a storage-first delivery flow with TEE fallback for fine-tuning model delivery and adds automatic handling of raw JSONL datasets. The changes significantly improve robustness and user experience.

Key improvements:

  • ✅ Local encrypted backup always created before storage upload (prevents data loss)
  • ✅ Graceful fallback when 0G Storage upload fails
  • ✅ Auto-detection and conversion of raw JSONL to HuggingFace format
  • ✅ Retry-safe file operations (cleans up existing targets)

Critical Issues

[HIGH] storage/client.go:118 - Context Not Propagated

Issue: Using context.Background() instead of the passed ctx parameter

Location: Line 118 (in the old code, now fixed in new code)

// OLD CODE (line 118 in diff):
if err := indexerClient.Download(context.Background(), hash, fileName, true); err != nil {

// NEW CODE (line 118 in current):
if err := indexerClient.Download(ctx, hash, tmpFile, true); err != nil {

Status: ✅ FIXED - The new code correctly uses ctx parameter

Impact: This fix ensures proper cancellation and timeout handling throughout the download chain.


[MEDIUM] storage/client.go:177-179 - Error Handling in Magic Bytes Detection

Issue: The isZipFile() function returns an error when file read fails, but small files (< 4 bytes) return false, nil

Location: Lines 177-183

buf := make([]byte, 4)
n, err := f.Read(buf)
if err != nil {
    return false, fmt.Errorf("failed to read file magic bytes: %w", err)
}
if n < 4 {
    // File too small to be a ZIP archive
    return false, nil
}

Analysis:

  • I/O errors are properly returned (good!)
  • Small files (< 4 bytes) are treated as "not a ZIP" which is reasonable
  • The caller at line 134-137 handles errors gracefully by defaulting to isZip = false

Status: ✅ Acceptable - Current error handling is sufficient. Small files can't be valid datasets anyway, and the caller's fallback behavior (treat as raw file) is appropriate.


[MEDIUM] setup.go:428-435 - Inconsistent Error Severity

Issue: Python execution failure logs a warning, but Docker failure returns an error

Location: Lines 428-450

cmd := exec.Command("python3", scriptPath, rawPath, datasetPath)
output, err := cmd.CombinedOutput()
if err == nil {
    s.logger.Infof("Converted raw dataset to HF format using Python: %s", string(output))
    os.Remove(rawPath)
    return nil
}
s.logger.Warnf("Direct Python conversion failed: %v (output: %s), trying Docker...", err, string(output))

// Fall back to Docker
// ... docker command ...
if err != nil {
    // Restore the raw file on failure
    os.Rename(rawPath, datasetPath)
    return errors.Wrapf(err, "convert raw dataset to HF format: %s", string(output))
}

Analysis: This is actually appropriate behavior:

  • Python failure → Warn and try Docker (graceful fallback)
  • Docker failure → Return error (no more options)

Status: ✅ Good design - The escalating severity matches the fallback chain.


[MEDIUM] finalizer.go:181-185 - Memory Usage for Large Files

Issue: Reading entire encrypted file into memory to compute hash

Location: Lines 181-185

encryptedData, err := os.ReadFile(encryptFile)
if err != nil {
    return nil, errors.Wrap(err, "failed to read encrypted file")
}
localRootHash := crypto.Keccak256(encryptedData)

Impact: For LoRA models (typically 10-500 MB), this is acceptable. Full models (GBs) would be problematic, but LoRA fine-tuning only produces small adapter weights.

Recommendation: Consider streaming hash computation for future-proofing:

f, err := os.Open(encryptFile)
if err != nil {
    return nil, errors.Wrap(err, "failed to open encrypted file")
}
defer f.Close()

hasher := sha3.NewLegacyKeccak256()
if _, err := io.Copy(hasher, f); err != nil {
    return nil, errors.Wrap(err, "failed to hash encrypted file")
}
localRootHash := hasher.Sum(nil)

Status: ⚠️ Low priority - Current implementation is acceptable for LoRA models.


Security Review

✅ No Security Issues Found

  • Command injection risk in exec.Command is safe - all paths are system-controlled, not user input
  • File operations properly handle cleanup
  • No sensitive data logged
  • Error messages don't leak internal paths to users

Resource Management

storage/client.go:124-130 - Excellent Cleanup Pattern

Good practice: Defer cleanup of temp file with proper error checking

defer func() {
    if _, statErr := os.Stat(tmpFile); statErr == nil {
        if rmErr := os.Remove(tmpFile); rmErr != nil && !os.IsNotExist(rmErr) {
            c.logger.Warnf("Failed to remove temp file %s: %v", tmpFile, rmErr)
        }
    }
}()

This ensures temp files are cleaned up even if subsequent operations fail.


storage/client.go:157-159 - Retry Safety

Good practice: Removing existing target before rename

if err := os.RemoveAll(filePath); err != nil {
    return "", errors.Wrap(err, "remove existing target path before move")
}
if err := os.Rename(tmpFile, filePath); err != nil {
    return "", errors.Wrap(err, "move downloaded file to target path")
}

This prevents "file exists" errors on retry, addressing the PR's stated goal.


setup.go:425 - Proper Script Cleanup

Good practice: Script cleanup is properly deferred after creation

if err := os.WriteFile(scriptPath, []byte(pythonScript), 0644); err != nil {
    return errors.Wrap(err, "write conversion script")
}
defer os.Remove(scriptPath)

Error Handling & Logging

finalizer.go:86-92 - Excellent User Feedback

Good practice: Clear logging with actionable information

f.logger.Infof("Encrypting LoRA model locally for task %s", task.ID)
settlementMetadata, err = f.encryptModelLocal(paths.Output, task)
if err != nil {
    return err
}
f.logger.Infof("Task %s encrypted locally. TEE backup available at /v1/user/%s/task/%s/lora", 
    task.ID, task.UserAddress, task.ID)

Users know exactly where to find their model if storage upload fails.


finalizer.go:100-108 - Clear Fallback Logging

Good practice: Distinguishes between storage success and fallback

if uploadErr != nil {
    f.logger.Warnf("Failed to upload to 0G Storage for task %s: %v. Using local backup hash.", task.ID, uploadErr)
} else {
    f.logger.Infof("Successfully uploaded to 0G Storage for task %s, root hash: %s", task.ID, string(storageRootHash))
    settlementMetadata.ModelRootHash = storageRootHash
}

setup.go:449-450 - Proper Error Recovery

Good practice: Restoring original file on conversion failure

if err != nil {
    // Restore the raw file on failure
    os.Rename(rawPath, datasetPath)
    return errors.Wrapf(err, "convert raw dataset to HF format: %s", string(output))
}

Design & Architecture

storage/client.go - Smart File Type Detection

Excellent design: Using magic bytes (PK\x03\x04) instead of file extensions

// ZIP magic bytes: PK\x03\x04
return buf[0] == 0x50 && buf[1] == 0x4B && buf[2] == 0x03 && buf[3] == 0x04, nil

This is the industry-standard approach and far more reliable than checking file extensions.


setup.go:363-418 - Self-Contained Conversion Script

Excellent design: Embedding Python script eliminates external dependencies

The script handles three common JSONL formats:

  1. Instruction format: {instruction, input, output}
  2. Messages format: {messages: [{role, content}]}
  3. Text format: {text}

This covers the vast majority of fine-tuning dataset formats.


✅ Storage-First Flow Architecture

Excellent architectural decision: The new flow is significantly more robust:

OLD FLOW:
Encrypt → Upload to Storage → Hope it works

NEW FLOW:
Encrypt locally (TEE backup) → Try storage upload → Fallback to local hash if fails

Benefits:

  1. ✅ Zero data loss - model always available via TEE endpoint
  2. ✅ Better UX - users can download immediately via TEE
  3. ✅ Graceful degradation - system functional even with storage outages
  4. ✅ Transparent to users - contract uses hash either way

Code Style & Standards

✅ Follows CLAUDE.md Standards

  • Proper error wrapping with context: errors.Wrap(err, "context")
  • Context propagation fixed (was an issue, now resolved)
  • Clear godoc-style comments for exported functions
  • Consistent logging with structured fields
  • No global mutable state
  • No panics for error handling

[nit] Minor Style Observations

Line 432, 454: os.Remove() errors ignored (acceptable for cleanup)

os.Remove(rawPath)  // No error check (acceptable for best-effort cleanup)

Line 147: Direct os.Remove() without error check (acceptable since next operation will fail if needed)

os.Remove(tmpFile)  // Clean up the downloaded zip

These are all acceptable - cleanup operations can safely ignore errors when subsequent operations will catch any issues.


Testing Observations

Test Plan Status (from PR description)

  • CLI --dataset-path uploads JSONL to 0G Storage
  • Broker downloads raw JSONL without "not a valid zip" error ✅ Fixed by this PR
  • Raw JSONL auto-converted to HF format
  • Token counting succeeds on converted dataset
  • Full training cycle completes end-to-end
  • Model download with auto mode (0G Storage → TEE fallback)

Recommendation: Complete the remaining integration tests before merge to ensure the storage-first flow works correctly in production.


Summary

Strengths

  1. Robust fallback strategy - Local backup + storage upload + graceful degradation
  2. Excellent error handling - Proper cleanup, recovery, and user feedback
  3. Smart file detection - Magic bytes instead of extensions
  4. Self-contained conversion - Embedded Python script with Docker fallback
  5. Retry safety - Cleans up existing targets before operations
  6. Good logging - Clear, actionable messages for debugging and user feedback
  7. Follows project standards - Adheres to CLAUDE.md guidelines

Issues Found

  • Context propagation - Already fixed in this PR
  • ⚠️ Memory usage for hash - Low priority, acceptable for LoRA models

Recommendations

  1. Complete integration tests - Finish the two remaining test items
  2. Consider streaming hash - For future-proofing (low priority)
  3. Document new flow - Update user-facing docs to explain TEE fallback

Verdict

Approve for merge after completing integration tests. This is a high-quality PR that significantly improves system robustness. The storage-first architecture is sound, error handling is thorough, and the code follows project standards well.

The automatic JSONL conversion is a great UX improvement that removes friction from the dataset upload process.


- Upgrade 0g-storage-client v1.2.1 → v1.2.2 to fix ABI incompatibility
  with testnet Flow contract (was causing `execution reverted; data: 0x`)
- Remove deprecated WithRoutines() calls (now set internally via
  UploaderConfig in v1.2.2)
- Fix uploadModel() to return raw 32-byte hash (hash.Bytes()) instead of
  hex string bytes ([]byte(hash.Hex())), which caused double-encoding
  via hexutil.Encode and prevented client from downloading model via
  0G Storage

Tested: Full flow verified with Qwen2.5-0.5B and Qwen3-32B (1.1GB model)
— broker upload, client download from 0G Storage, decrypt, LoRA verified.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Ravenyjh Ravenyjh merged commit db61c7a into main Feb 12, 2026
@Ravenyjh Ravenyjh deleted the feature/storage-first-flow-v2 branch February 12, 2026 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants